Overview

So far, we’ve been using just the simple mean to make predictions. Today, we’ll continue using the simple mean to make predictions, but now in a complicated way. Before, when we calculated conditional means, we did so in certain “groupings” of variables. When we run linear regression, we no longer need to do so. Instead, linear regression allows us to calculate the conditional mean of the outcome at every value of the predictor. If the predictor takes on just a few values, then that’s the number of conditional means that will be calculated. If the predictor is continuous and takes on a large number of values, we’ll still be able to calculate the conditional mean at every one of those values.

The model we posit for regression is as follows:

\[Y=\beta_0+\beta_1 x_1 +\beta_2 x_2+ ... \beta_k x_k + \epsilon\]

It’s just a linear, additive model. Y increases or decreases as a function of x, with multiple x’s included. \(\epsilon\) is the extent to which an individual value is above or below the line created. \(\beta\) is the amount that \(Y\) increases or decreases for a one unit change in \(x\).

The Process

When working with regression models, there are a series of steps we will work through to build and test our models. There is some flexibility in the order they are done, but for now the easiest way to learn will be to approach them in the following order:

  1. Data Exploration
  2. Split the Data
  3. Begin a Workflow (Will discuss next week)
  4. Specify the Formula
  5. Specify the Model
  6. Set/Update Recipe (Will discuss next week)
  7. Assemble Workflow (Will discuss next week)
  8. Fit the Model to the Data
  9. Predict the Testing Data
  10. Calculate Fit Repeat Steps 3-7 as needed

Step 1: Data Exploration

The Data

load("area_data.Rdata")

ad<-as.data.frame(area_data)

This data comes from the American Community Survey of 2019. It covers all of the metro or micro statistical areas in the United States. It includes characteristics of these areas, include education, income, home ownership and others as described below.

Name Description
name Name of Micro/Metro Area
college_educ Percent of population with at least a bachelor’s degree
perc_commute_30p Percent of population with commute to work of 30 minutes or more
perc_insured Percent of population with health insurance
perc_homeown Percent of housing units owned by occupier
geoid Geographic FIPS Code (id)
income_75 Percent of population with income over 75,000
perc_moved_in Percent of population that moved from another state in last year
perc_in_labor force Percent of population in labor force
metro Metropolitan Area? Yes/No
state State Abbreviation
region Census Region
division Census Division

Dependent Variable

Our dependent variable will be the percent of the population that has a yearly income over $75,000. Let’s take a look at this variable.

ad%>%
  ggplot(aes(x=income_75))+
  geom_density()

This variable has a nice symmetric distribution. It looks approximately normal, which will help in interpreting the results.

Visualizing the Relationship - potential predictor (college_educ)

ad%>%
  ggplot(aes(x=college_educ, y=income_75)) +
  geom_point() + #produces scatterplot of % college ed by % income over 75
  geom_smooth(method="lm") #adds linear model line of best fit
## `geom_smooth()` using formula 'y ~ x'

#THINK about what each dot represents on this plot. 

It’s a good idea to plot the data before running a regression. The scatterplot above shows the percent of the population with incomes above 75,000 on the y axis and the percent of the population with a bachelor’s degree on the x axis. I always like to be able to identify the individual units in datasets like this. To do that I use the ggplotly command below.

An Interactive Approach
gg<-ad%>%
  ggplot(aes(x=college_educ, y=income_75, label=name))+
  geom_point()+
  geom_smooth(method="lm")

ggplotly(gg)
## `geom_smooth()` using formula 'y ~ x'

By using the plotly command, we can identify the individual units. Try it out, seeing which units have high or low incomes or higher or lower percentages of the population with a bachelor’s degree.

Note: if using plotly you cannot knit as PDF. You will need to knit as HTML.

Step 2: Split the Data

Testing and Training

The essence of prediction is discovering the extent to which our models generalize or predict outcomes for data that does not come from our sample. Many times this process is temporal. We fit a model to data from one time period, then take predictors from a subsequent time period to come up with a prediction in the future. For instance, we might use data on team performance to predict the likely winners and losers for upcoming soccer games.

This process does not have to be temporal. We can also have data that is out of sample because it hadn’t yet been collected when our first data was collected, or we can also have data that is out of sample because we designated it as out of sample.

The data that is used to generate our predictions is known as training data. The idea is that this is the data used to train our model, to let it know what the relationship is between our predictors and our outcome. So far, we have only worked with training data.

That data that is used to validate our predictions is known as testing data. With testing data, we take our trained model and see how good it is at predicting outcomes using out of sample data.

One very simple approach to this would be to cut our data in half. We could then train our model on half the data, then test it on the other half. This would tell us whether our measure of model fit (e.g. rmse) is similar or different when we apply our model to out of sample data. That’s what we’re going to do in this lesson. We’ll split the data randomly in half, with one half used to train our models and the other half used to test the model against out of sample data.

I use the initial_split command to designate the way that the data should be split– in this case in half (.5). The testing data (which is a random half of the original dataset) is stored as ad_test. The training data is stored as ad_train. We’ll run our models on ad_train.

Note: the set.seed command ensures that your random split should be the same as my random split.

set.seed(35202)

split_data<-ad%>%
  initial_split(prop=.5)

ad_train<-training(split_data)
ad_test<-testing(split_data)

Step 3: Begin a Workflow

We will discuss next week as an organizational framework for running multiple models to compare.

Step 4: Specify the Formula

To start training our model, we need to specify a formula. The dependent variable always goes on the left hand side, while the independent variable or variables always go on the right hand side. The formula below is for a model with the percent of the population with incomes above 75,000 on the left hand side, and the percent of the population with a college education on the right hand side.

income_formula<-as.formula("income_75 ~ college_educ")

Step 5: Specify the Model

The next step is to define the model. The type of model I want is a linear regression, which I specify below. This is also sometimes called “ordinary least squares” regression.

lm_fit <- 
  linear_reg() %>% 
  set_engine("lm")%>%
  set_mode("regression")

Step 6: Set/Update Recipe

We will discuss next week how to use recipes within the workflow framework.

Step 7: Assemble Workflow

Stay tuned for next week.

Step 8: Fit the Model to the Data

Now we’re ready to actually fit the model to the data. In the fit command below I specify which formula should be used and which dataset to fit the linear regression specified above.

lm_results<-
  lm_fit%>%
  fit(income_formula, data=ad_train) #we build/tweak our model in the TRAINING dataset

Look at the Results

To examine the results, we can use the summary command.

summary(lm_results$fit)
## 
## Call:
## stats::lm(formula = income_75 ~ college_educ, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.3046  -3.4308  -0.1221   3.3638  24.9082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.1605     0.8057   23.78   <2e-16 ***
## college_educ   0.6160     0.0311   19.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.969 on 461 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4586 
## F-statistic: 392.3 on 1 and 461 DF,  p-value: < 2.2e-16

What this tells us is that for each additional percent of the population with a bachelor’s degree, the percent of the population with incomes above 75,000 is predicted to increase by 0.62

Step 9: Predict the Testing Data

Since we have fit the model to the training data and obtained parameter estimates, we are going to use those estimates to generate predicted values for our testing data. We are intentionally constraining the predicted values of the testing data as a means of checking generalizabiity. Since the data all came from the same “population” (the original dataset). The results calculated from one half should fit the other.

ad_test<-lm_results%>% ## start with the results
  predict(new_data=ad_test)%>% ## create a prediction
  rename(pred_educ=.pred)%>% ## rename the prediction- I'm using a more descriptive name than just "pred" here
  bind_cols(ad_test) ## add the prediction to the testing dataset

Now we have the prediction in our testing dataset, named pred_educ using the rename command. We can compare the actual value with the predicted value in the testing data using rmse.

Step 10: Calculate Fit

Calculate RMSE

rmse_1<-yardstick::rmse(ad_test,
     truth=income_75,
     estimate=pred_educ)

rmse_1
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.22

The rmse of 6.2 indicates how close our prediction is to the actual value in the testing dataset. This is important because it gives a sense of how the model we trained performs when trying to make predictions using new– or out of sample– data.

Quick Exercise Run a regression using a different predictor. Calculate rmse and see if you can beat my score.

Multiple regression

We can continue refining our model by adding more variables. Regression with more than one predictor is called multiple regression. The nice thing is that the steps will all be the same, we just need to change the formula and the rest stays pretty much as it was.

Step 1: Data Exploration

Typically, before running an MLR we want to check the correlations between our IVs(predictors) and our DV(outcome) as well as the correlations across our IVs. We WANT our IVs to be highly correlated with our DV (this indicates they might be good predictors, right?) but not highly correlated with each other.

  1. If an IV isn’t correlated to your DV, it will not be statistically significant in a regression model (without correlation, you can’t show prediction).
  2. If two (or more) IVs are TOO highly correlated, then you may run into issues with multicollinearity. Here’s a good/brief explanation of what that is and why it’s a problem.
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.2.1
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
cormat<-tab_corr(ad%>%select(income_75, college_educ, perc_homeown), triangle = "lower", string.diag = c("1","1","1"))
cormat
  income_75 college_educ perc_homeown
income_75 1    
college_educ 0.695*** 1  
perc_homeown 0.074* -0.164*** 1
Computed correlation used pearson-method with listwise-deletion.

Above we see strong correlations between our IVs and our DV. Now, let’s look at the correlation between IVs. Here’s where we do NOT want super high correlations. We see the correlation between perc_homeown & college_educ is only about -.2 (although that is statistically significant at the .001 level, the actual correlation magnitude is fairly low).

Step 2: Split the Data

We already did this step, so we don’t need to do it again.

Step 3: Begin a Workflow

Next week.

Step 4: Specify the Formula

income_formula<-as.formula("income_75 ~ college_educ + perc_homeown")

Step 5: Specify the Model

lm_fit <- 
  linear_reg() %>% 
  set_engine("lm")%>%
  set_mode("regression")

Step 6: Set/Update Recipe

Next week.

Step 7: Assemble Workflow

Next week.

Step 8: Fit the Model to the Data

lm_results<-
  lm_fit%>%
  fit(income_formula, data=ad_train)

Look at the results

summary(lm_results$fit)
## 
## Call:
## stats::lm(formula = income_75 ~ college_educ + perc_homeown, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.764  -3.427  -0.247   3.089  25.496 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.80022    2.98639  -1.273    0.204    
## college_educ  0.66144    0.02975  22.234  < 2e-16 ***
## perc_homeown  0.31637    0.03981   7.948 1.48e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.603 on 460 degrees of freedom
## Multiple R-squared:  0.525,  Adjusted R-squared:  0.5229 
## F-statistic: 254.2 on 2 and 460 DF,  p-value: < 2.2e-16

Step 9: Predict the Testing Data

ad_test<-lm_results%>%
  predict(new_data=ad_test)%>%
  rename(pred_coll_hown=.pred)%>%
  bind_cols(ad_test)

Step 10: Calculate Fit

rmse_2<-yardstick::rmse(ad_test,
     truth=income_75,
     estimate=pred_coll_hown)

rmse_2
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.18

The rmse of 6.2 shows that we do get a more accurate prediction in the testing dataset using two predictors.

You MUST remember: correlation is not causation. All you can pick up on using this tool is associations, or common patterns. You can’t know whether one thing causes another. Remember that the left hand side variable could just as easily be on the right hand side.

Quick Exercise

Run a regression using two different predictors. Calculate rmse and see if you can beat my score.